Optimising brain age estimation through transfer learning: A suite of pre‐trained foundation models for improved performance and generalisability in a clinical setting

Abstract Estimated age from brain MRI data has emerged as a promising biomarker of neurological health. However, the absence of large, diverse, and clinically representative training datasets, along with the complexity of managing heterogeneous MRI data, presents significant barriers to the development of accurate and generalisable models appropriate for clinical use. Here, we present a deep learning framework trained on routine clinical data (N up to 18,890, age range 18–96 years). We trained five separate models for accurate brain age prediction (all with mean absolute error ≤4.0 years, R 2 ≥ .86) across five different MRI sequences (T2‐weighted, T2‐FLAIR, T1‐weighted, diffusion‐weighted, and gradient‐recalled echo T2*‐weighted). Our trained models offer dual functionality. First, they have the potential to be directly employed on clinical data. Second, they can be used as foundation models for further refinement to accommodate a range of other MRI sequences (and therefore a range of clinical scenarios which employ such sequences). This adaptation process, enabled by transfer learning, proved effective in our study across a range of MRI sequences and scan orientations, including those which differed considerably from the original training datasets. Crucially, our findings suggest that this approach remains viable even with limited data availability (as low as N = 25 for fine‐tuning), thus broadening the application of brain age estimation to more diverse clinical contexts and patient populations. By making these models publicly available, we aim to provide the scientific community with a versatile toolkit, promoting further research in brain age prediction and related areas.


| INTRODUCTION
Brain age estimation uses neuroimaging data to determine an individual's biological age and has shown potential as a biomarker of neurological health (Cole & Franke, 2017).The underlying assumption is that typical brain development and ageing processes follow predictable trajectories and that divergences from these patterns can signal neurodegenerative processes or accentuate age-related brain health issues.Such deviations are quantified in individuals by comparing their estimated brain age with their chronological age, resulting in a brainpredicted age difference (brain-PAD) (Cole et al., 2017;Smith et al., 2019).A positive brain-PAD, indicating an older-appearing brain compared to actual chronological age, has been associated with numerous neurological and psychiatric conditions (Franke & Gaser, 2019) and future health outcomes (Biondo et al., 2022;Elliot et al., 2021;Popescu et al., 2020).These findings underscore the possible value of brain age estimation as a non-invasive tool for early diagnosis, patient stratification, and monitoring of disease progression.
A key goal of brain age research is to ultimately benefit patients (Kelly et al., 2019).However, realising this goal will involve overcoming several challenges.One challenge is the lack of representativeness of research datasets (Agarwal et al., 2023;Agarwal & Wood et al., 2023;Din et al., 2023), particularly public datasets commonly used for training brain age models.This applies not only to the demographics of the study participants, but also to the nature of the MRI data (e.g.sequences and acquisition parameters) and to the data quality.
Another related challenge is training sample size, whereby smaller samples are less likely to be representative of downstream test sets, hence limiting generalisability.One option to overcome this would be to train models from scratch using local data that are more representative of the target population.However, this is not possible in many circumstances, where local data suitable for training are not routinely acquired, budgets for large-scale data collection are limited, or diseases have a low prevalence.
A promising avenue for making machine learning models more representative and generalisable is transfer learning.Put simply, the idea is to transfer what is learned from one machine learning task to another task (Chelliah et al., 2024;Zhuang et al., 2020).This type of 'domain adaptation' typically involves 'fine-tuning' the original model using a subset of labelled data from the second task.Using a pretrained model in this way aims to benefit downstream model training speed (i.e. the time to convergence) and performance compared to training a new model from scratch, where the network weights and biases are initialised randomly.Transfer learning is an established technique in natural language processing (Devlin et al., 2018;Howard & Ruder, 2018) and computer vision (Yosinski et al., 2014) and is becoming increasingly popular in neuroimaging (Agarwal et al., 2021;Ardalan & Subbian, 2022).Transfer learning has already been applied with some success in the context of brain age (Chen et al., 2020;Jonsson et al., 2019;Leonardsen et al., 2022), showing how using pre-trained models can improve downstream prediction in, for example, specific disease groups.However, these studies all used research cohorts with high-quality MRI and only focused on a single modality (T 1 -weighted or diffusion-weighted).
Here, we aimed to use transfer learning to overcome some of the limitations of previous brain age studies, building on our previous work showing how convolutional neural network (CNN) models of brain age can be trained to predict age from various clinical-grade (i.e.non-volumetric) MRI modalities (Wood et al., 2022a).We trained, at scale, different brain age models for different modalities, using data from a large and clinically representative dataset, with the goal of generating a framework for transferring knowledge (i.e.pre-trained models) to a breadth of possible scenarios.
We hypothesised that we could train accurate age prediction 'baseline' models from clinical-grade scans of five different MRI sequences (T 2 -weighted, T 2 -FLAIR, T 1 -weighted, diffusion-weighted (DWI), and gradient-recalled echo (GRE) T 2 *-weighted) and that the most accurate performance could be achieved by combining predictions with an ensemble of all five models.We further hypothesised that transfer learning could be used to improve generalisability in a variety of downstream scenarios, namely out-of-sample testing on (i) data with equivalent acquisition parameters acquired at a different site, (ii) data with the same modality but a different primary acquisition plane, or (iii) data of a different modality from the baseline pretrained model.This was done by comparing prediction performance of baseline models with no fine-tuning versus fine-tuned models or when training on the new data from scratch (i.e.no transfer learning).
Finally, we explored the necessary sample sizes required to achieve improved performance during fine-tuning.

| Datasets
All data were de-identified.The UK National Health Research Authority and Research Ethics Committee approved this retrospective study (IRAS ID 235,658, REC ID 18/YH/0458).

| Head MRI clinical datasets for brain age model development
The dataset used in this study was the same as that used in previous brain age modelling work (Wood et al., 2022a).Briefly, all 81,936 adult The number of MRI sequences acquired during each examination in this dataset ranged from 1 to 8 (Figure A1 in Appendix A).The most frequently acquired sequence and orientation combinations were axial T 2 -weighted, axial DWI, coronal T 2 -FLAIR, sagittal T 1 -weighted, and axial GRE T 2 *-weighted images, performed in 97.2%, 78.5%, 66.1%, 43.8%, and 43.7% of examinations, respectively.We elected to develop individual 'baseline' brain age models for these five common sequences and explore transfer learning using public datasets (IXI, OASIS-3, ADNI-described in detail below) to facilitate brain age modelling for scans that appeared with lower frequency in our study dataset (e.g.susceptibility-weighted and proton density-weighted images).Our baseline brain age models therefore serve a dual purpose.First, they can be directly used to predict brain age using the specific MRI sequence they were trained on.Second, they serve as foundation models for transfer learning, allowing further tuning on new, possibly smaller datasets, to improve generalisability performance or to adapt to a range of other MRI sequences not seen during the initial training.
A subset of 'radiologically normal for age' examinations was identified using a dedicated transformer-based neuroradiology report classifier (Wood et al., 2020(Wood et al., , 2021;;Wood et al., 2022b).This model was trained using a large dataset of neuroradiology reports from KCH (N = 5000) which had been annotated by a team of five expert neuroradiologists (UK consultant grade; US attending equivalent) as either 'radiologically normal for age' or 'radiologically abnormal for age' based on well-defined criteria (Benger et al., 2023;Wood et al., 2020;Wood et al., 2022b).Briefly, findings that could lead to a subsequent clinical intervention were labelled as 'abnormal' (a referral for case discussion at a multidisciplinary team meeting was considered the minimal intervention).Importantly, the abnormal category included findings deemed 'excessive for age' (e.g.excessive volume loss and extensive small vessel disease observed on T 2 -weighted images).
In this previous work, the classifier demonstrated near-perfect accuracy (area under the receiver operating characteristic curve [AUC] = 0.991) on a testing dataset of 500 radiology reports from KCH and generalised to an external testing dataset of 500 reports from GSTT (AUC = 0.990).
In the current study, a total of 22,302 examinations from the larger dataset were identified as 'radiologically normal for age' and datasets were created by randomly selecting 3500 examinations (N unique patients = 3500) from the subset that included all five MRI sequences (Figure 1).We removed any overlapping instances of patients that were present in the testing or validation datasets from the remaining pool of examinations, resulting in a training dataset of 18,890 examinations comprising different numbers of each type of MRI sequence (Table 1).This method of dividing the data ensured that (i) all baseline brain age models were tested on a dataset of the same size using the same examinations and (ii) there was no 'data leakage' (i.e.patients in the training set did not appear in the validation or testing sets).
Henceforth, we refer to these datasets as the 'internal clinical datasets', to distinguish them from the 'out-of-sample testing datasets' presented in Section 2.1.2.2).
These scans were acquired at three different UK institutions between 2005 and 2008 (Hammersmith Hospital, using a Phillips 3 T system; Guy's Hospital, using a Phillips 1.5 T system; and the Institute of Psychiatry, Psychology and Neuroscience, using a GE 1.

| Neuroimaging processing
We performed minimal pre-processing of raw head MRI scans.Specif- All pre-processing was carried out using open-source software: dcm2niix (Li et al., 2016) for DICOM-to-NIfTI conversion, NiBabel (Brett et al., 2020) for loading and manipulating NIfTI files, and Project MONAI (Cardoso et al., 2022) for resampling and resizing images.
To explore the application of transfer learning to allow our baseline models to generalise to research datasets which often contain

| Brain age modelling
Each baseline brain age model was based on the 'DenseNet201' architecture (Huang et al., 2017), with modifications to accommodate 3D neuroimaging data.Our network (Figure 3) consists of an initial block of 64 convolutional filters and a 'max pooling' layer, followed by four 'densely connected' convolutional blocks.Each dense block comprises alternating pointwise and volumetric convolutions which are repeated 6, 12, 48, and 32 times across the four blocks, respectively.Between each dense block are 'transition layers' which consist of a point convolution and an average pooling layer.Global average pooling is applied to the output of the final dense block, resulting in a 1920-dimensional feature vector which is converted by a fully connected layer into a prediction for the patient's age.
We elected to use a standard, pre-existing network, rather than We determined the optimal weights (i.e.α T2 , α FLAIR , α T1 , α DWI , and α GRE ) by fitting a 5-parameter linear regression model with no intercept term using predictions obtained for the validation set.Regression modelling was performed using scikit-learn 0.24.0 (Pedregosa et al., 2011), and all hyperparameters were set to the default values.
We evaluated the performance of our pre-trained baseline models with out-of-sample test set images in three distinct ways.
First, we examined model performance without any additional fine-    3).
When comparing the five different baseline models with one another, highly correlated pairwise predictions were seen (r ≥ .92)(Figure 6).

| Ensemble models for enhanced brain age prediction
An ensemble model which combined the predictions of each baseline model through a simple mean aggregation strategy outperformed all baseline models individually using the same internal clinical testing dataset (p < .0001)(MAE = 2.51 years, r = .97)(Figure 7; Table 4).

| Generalisability to out-of-sample testing data
Baseline models, tested on out-of-sample images with the equivalent Axial DWI (IXI) Boxplots showing baseline model generalisability for out-of-sample scans.Models were tested (i) without additional training, which we refer to as 'out-of-sample testing without transfer learning' (dotted red lines), and (ii) after applying transfer learning with a subset of the out-of-sample data serving as an additional fine-tuning training dataset, which we refer to as 'out-of-sample testing with transfer learning' (green boxes).Comparison was made with additional, architecturally identical models trained 'from scratch' (i.e.without transfer learning) using out-of-sample data exclusively, which we refer to as 'de novo out-of-sample training without transfer learning' (blue boxes).In all cases, applying transfer learning outperformed de novo out-of-sample training.
T A B L E 5 Generalisability of our baseline models to out-of-sample images with sequences and orientations that were equivalent or closely related to the corresponding internal clinical training datasets.

| DISCUSSION
In this study, we have presented an accurate, robust, and generalisable deep learning framework for brain age prediction using a variety of common MRI sequences.Our results emphasise the value of training at scale using large and diverse training datasets and underscore the importance of ensemble methods and transfer learning in improving accuracy and generalisability.
Several key elements distinguish our study.First, the use of a cutting-edge, transformer-based neuroradiology report classifier enabled us to generate a large, clinically representative training dataset (Wood et al., 2020(Wood et al., , 2021(Wood et al., , 2022)).This step successfully overcame a significant obstacle often faced in brain age model development (i.e.identifying radiological normal scans in a large hospital dataset), resulting in a diverse and realistic set of training data that accurately represents clinical populations (Agarwal et al., 2023;Booth et al., 2023;Din et al., 2023).The diversity of our data, encompassing a range of scanner vendors, acquisition protocols, patient ethnicities, and a wide age span (18-96 years), added robustness to our models.
As a result, our baseline models demonstrated strong generalisation with out-of-sample data and formed an effective basis for further enhancements through ensemble methods and transfer learning.or those that provide complementary information (Cole et al., 2020;Wood et al., 2019).
The successful application of transfer learning was another key aspect of our study.Transfer learning, the technique of using knowledge gained from one task to improve performance on a related but different task, has been shown to be well suited for brain age prediction (Chen et al., 2020;Jonsson et al., 2019;Leonardsen et al., 2022).
The inherent diversity of MRI data (either within the clinic or research settings), such as differences in resolution, field strength, sequence weighting, and orientations, can make it challenging to develop generalisable models.
Unlike much existing work that predominantly relies on models pre-trained on unrelated tasks, such as ImageNet for image recognition (Bashyam et al., 2020;Jiang et al., 2019;Lin et al., 2021), our study uniquely capitalises on pre-trained brain age models for finetuning.By doing so, we highlight the potential benefits of domain- proves to be relevant.
In conclusion, our study presents a flexible and effective approach for brain age prediction using MRI data.By demonstrating the power of ensemble methods and transfer learning, we aim to inspire further exploration in this area and potentially others within the field of neuroradiology.Future studies should address the limitations mentioned and further validate the performance and applicability of the proposed framework in specialised clinical contexts.
By making our pre-trained models openly accessible, we hope to provide the scientific community with a versatile toolkit that can be used directly or further fine-tuned to suit the specific requirements of different clinical scenarios and MRI sequences.

(
≥18 years old) head MRI examinations performed in the UK at Guy's and St Thomas' NHS Foundation Trust (GSTT) and King's College Hospital NHS Foundation Trust (KCH) between 2008 and 2019 were collected retrospectively.The MRI scans were performed using Ingenia 1.5 T (Philips Healthcare, Eindhoven, Netherlands), Aera 1.5 T (Siemens, Erlangen, Germany), Signa 1.5 T HDX (General Electric Healthcare, Chicago, USA), or Skyra 3 T (Siemens, Erlangen, Germany) scanners.The corresponding free-text radiology reports produced by 17 expert neuroradiologists were extracted from the Computerised Radiology Information System (CRIS) (Healthcare Software Systems, Mansfield, UK).These reports were predominantly unstructured, typically consisting of 5-10 sentences describing the image interpretation, along with comments regarding the patient's clinical history and recommended actions for the referring physician.

2. 1
.2 | External 'out-of-sample testing' datasets To determine whether there is an improvement in baseline model generalisability following fine-tuning with transfer learning, images from three publicly accessible datasets were utilised.These datasets included equivalent MRI sequences present in the internal clinical datasets used for baseline brain age model development (Section 2.1.1),along with MRI sequences not typically acquired during routine clinical examinations.All axial T 2 -weighted (N = 560), axial DWI (N = 389), and volumetric T 1 -weighted (N = 563) scans from the Information eXtraction from Images (IXI) healthy subject dataset were obtained (Table 5 T system) and can be downloaded from https://brain-development.org/ixi-dataset/.Similarly, all axial susceptibility-weighted images (SWI) (N = 453) and axial T 2 -FLAIR images (N = 381) for the subset of first-visit, cognitively normal participants from the Open Access Series of Imaging Studies (OASIS-3) dataset were obtained.These scans were acquired at the Washington University Knight Alzheimer Disease Research Center using three different Siemens scanners (Vision 1.5 T, TIM Trio 3 T, and BioGraph mMR PET-MR 3 T) and can be downloaded from https://www.oasis-brains.org/.Finally, all axial proton density (PD)weighted (N = 773), volumetric DWI (N = 101), and volumetric T2-FLAIR (N = 503) images for the subset of normal participants only from the Alzheimer's Disease Neuroimaging Initiative (ADNI-3) dataset were obtained.The scans were performed across 49 sites in the United States (see https://adni.loni.usc.edu/about/centers-cores/study-sites/), using 1.5 T Siemens, 1.5 T GE, and 1.5 T Philips scanners and can be downloaded from https://adni.loni.usc.edu/datasamples/access-data/. Figure 2 provides an overview of all the different types of head MRI scans used in this study.
ically, axial T 2 -weighted, axial DWI, coronal T 2 -FLAIR, sagittal T 1weighted, axial GRE T 2 *-weighted, axial SWI, axial PD-weighted, and volumetric T 1 -weighted images with arbitrary resolution and dimensions, stored as Digital Imaging and Communications in Medicine (DICOM) files, were converted into NIfTI format, resampled to WOOD ET AL. common voxel sizes and dimensions (1.4 mm 3 ), and then cropped or padded to achieve a uniform image size (182 mm Â 182 mm Â 182 mm, corresponding to a 3D array, or 'tensor', with dimensions 130 Â 130 Â 130).Each image's intensity was normalised by subtracting the mean and dividing by the standard deviation.Spatial registration, bias field correction, and skull-stripping were not performed.
design a custom architecture, to ensure reproducibility and transparency of our framework.The incorporation of a global average pooling layer in DenseNet201 made it an attractive choice as it allows for the handling of images of different sizes to those encountered during training.The brain age models used in this study were adapted from the Project MONAI DenseNet201 implementation.All experiments were conducted using PyTorch 1.7.1 (Paszke et al., 2019) with two NVIDIA RTX 2080 graphics processing units (GPUs).Each baseline model was trained by minimising the L1 loss (i.e.absolute error loss) between chronological age and predicted age, with the Adam optimiser (Kingma & Ba, 2014) used to update CNN weights.The batch size was set to 14 as this was the maximum possible size using two 12-GB GPUs.The learning rate for baseline model training was initially set to 10 À4 and then reduced by a factor of 2 after every five epochs without improvement on the validation set.In total, each model was trained for 100 epochs; however, checkpoints were saved after each epoch, and the model configuration with the lowest validation set loss was used for testing.In other words, early stopping was employed.The mean absolute error (MAE), Pearson's correlation (r), and the coefficient of determination (R 2 ) were used to quantify baseline model performance in the internal clinical test sets.Pearson's correlation was also used to quantify the pair-wise agreement between brain age predictions produced by separate baseline models for patients in the internal clinical test sets.Paired Student's t tests were used to test the statistical significance of differences in performance between the baseline models.We also explored the use of ensemble methods to enhance the accuracy of brain age prediction in our internal clinical test sets.Two different aggregation strategies were applied to combine the predictions of individual baseline models into a single examination-level prediction of brain age.The first strategy involved a simple mean aggregation approach, whereby the predicted age was obtained by averaging the predictions from each baseline model (i.e.predicted age = [axial T 2 -weighted prediction + coronal T 2 -FLAIR prediction + sagittal T 1 -weighted prediction + axial DWI prediction + axial GRE T 2 *-weighted prediction]/5).The second strategy utilised a weighted aggregation approach, whereby the predicted age was determined by combining the predictions of each baseline model using different weights (i.e.predicted age = α T2 * axial T 2 -weighted prediction + α FLAIR * coronal T 2 -FLAIR prediction + α T1 * sagittal T 1 -weighted prediction + α DWI * axial DWI prediction + α GRE * axial GRE T 2 *weighted prediction).
tuning, which we refer to as 'out-of-sample testing without transfer learning'.Second, we applied transfer learning, using 80% of the outof-sample test set data for fine-tuning and reserving 20% for testing, a process we term 'out-of-sample testing with transfer learning'.Finally, we compared these results with those obtained from architecturally identical models that were trained entirely from scratch using the external, publicly available datasets exclusively (i.e.without any transfer learning); we refer to this as 'de novo out-of-sample training without transfer learning'.Confidence intervals for the transfer learning and de novo testing approaches were generated using a fivefold F I G U R E 2 Overview of the different types of head MRI scans used for brain age modelling in this study.A transfer learning experiment using skull-stripped data is described in Section 2.2.cross-validation procedure, ensuring that each image was only tested once.For model fine-tuning, the initial learning rate was set to 10 À5and then reduced by a factor of 2 after every five epochs without improvement on the out-of-sample validation set.All other hyperparameters (i.e.optimiser and mini-batch size) were identical to those used during the original baseline model training.A summary of our out-of-sample testing approaches is provided in Figure 4.The impact of our transfer learning approach was assessed in three scenarios: (i) using out-of-sample MRI sequences and orientations which matched those used to train the corresponding baseline models (e.g.fine-tuning the axial T 2 -weighted baseline model with out-of-sample axial T 2 -weighted images); (ii) using closely related sequences and orientations (e.g.fine-tuning the sagittal T 1 -weighted model with out-of-sample volumetric T 1 -weighted images); and (iii) using markedly different sequences and orientations (e.g.fine-tuning the axial T 2 -weighted model with out-of-sample axial susceptibility-weighted, PD-weighted images, or even skull-stripped axial T 2 -weighted images).An overview of these three transfer learning scenarios is provided in Figure B1 in Appendix B.To explore the influence of sample size on the baseline model fine-tuning process, we conducted sample size control experiments.We separately applied transfer learning to baseline models with varying numbers of out-of-sample scans serving as the fine-tuning training dataset(specifically 10, 25, 50, 100, 150, 200, 250, 300, and   350 scans).For each sample size, we used a consistent test set (N = 135) and generated confidence intervals by repeating the training and testing process using five separate training datasets randomly sampled from the remaining scans.Scripts to enable readers to run and fine-tune our trained baseline models using their own MRI scans are available at https://github.com/MIDIconsortium/BrainAge.F I G U R E 3 DenseNet201 3D convolutional neural network architecture used in this study.Also shown are the output sizes at each internal layer of the network for an input image of size 182 mm Â 182 mm Â 182 mm, corresponding to an image tensor of shape 130 Â 130 Â 130. F I G U R E 4 Overview of the three out-of-sample testing procedures used in this study.'Out-of-sample testing without transfer learning' (left) involves the direct application of baseline models to out-of-sample (i.e.IXI, OASIS-3, or ADNI) data without any further fine-tuning.'Out-ofsample testing with transfer learning' (middle) involves fine-tuning the baseline models, with a subset of the out-of-sample data serving as an additional training set and the remaining out-of-sample data used for testing.'De novo out-of-sample training and testing without transfer learning' (right) involves training architecturally identical DenseNet201 models from scratch (i.e.without transfer learning), with out-of-sample data serving as the training and testing sets.F I G U R E 5 Scatter plots of predicted age versus chronological age in the internal clinical testing sets for each baseline model.Accurate age estimation was achieved for all five models (MAE ≤ 4.0 years, r ≥ .93),with the highest accuracy observed using the axial T 2 -weighted model (MAE = 2.85 years, r = =.97).

|F
Brain age prediction using baseline models All baseline models, representing the five commonest sequences and orientations in study dataset, predicted chronological age with high accuracy in the internal clinical testing datasets (MAE ≤ 4.0 years, Pearson's correlation, r ≥ .93).The axial T 2 -weighted model achieved the best test set performance (MAE = 2.85 years, r = .97),followed by the coronal T 2 -FLAIR (MAE = 3.25 years, r = .96),sagittal T 1 -T A B L E 3 Brain age prediction results for the five baseline models considered in this study using the internal clinical testing datasets.I G U R E 6 Pair-wise scatter plots comparing brain age predictions for patients in the internal clinical testing sets using different baseline models.Strong correlation (r ≥ .92) between predictions was seen for all pairs of models.To avoid redundancy, no duplicate graphs are shown (the underlying correlation matrix is symmetric with unit diagonal as shown in Figure C1 in Appendix C). weighted (MAE = 3.38 years, r = .95),axial DWI (MAE = 3.55 years, r = .95),and axial GRE T 2 *-weighted (MAE = 4.00 years, r = .93)models (Figure 5; Table Figure D2 in Appendix D; Table5).Baseline models tested on out-of-sample images with sequences that were markedly different to the corresponding internal clinical training datasets demonstrated a large reduction in generalisability when there had been no additional fine-tuning.The best individual accurately (axial SWI scans: MAE = 20.94years; axial PD-weighted scans: MAE = 26.54years; axial T 2 -weighted scans with 1 mm 3 voxels, brain tissue removed and spatially registered: MAE = 11.34 years).However, applying transfer learning with this baseline model resulted in substantial improvements (axial SWI scans: MAE = 4.65 years; axial PD-weighted scans: MAE = 3.92 years; axial T 2 -weighted scans with 1 mm 3 voxels, brain tissue removed and spatially registered: MAE = 4.21 years).Again, transfer learning outperformed de novo out-of-sample training with architecturally identical models (axial SWI scans: MAE = 6.50 years, p < .0001;axial PD-weighted scans: MAE = 7.78 years, p < .0001;axial T 2 -weighted scans with 1 mm 3 voxels, brain tissue removed and spatially registered: MAE = 4.51 years, p = .023)(Figure 9).

3. 4 |
Dataset size control analysisBy applying transfer learning using different fine-tuning training sample sizes, we observed that substantial improvements in age estimation can be achieved with only modest quantities of out-of-sample scans.In all three scenarios (i.e. when applied to scans matching, or similar to, or markedly different from those in the corresponding internal clinical training datasets), baseline model performance rapidly improved with as little as 25-100 out-of-sample scans and plateaued with dataset sizes greater than 200 (Figure10).
Ensemble methods are another key component of our study.These methods integrate the strengths of multiple distinct models, thereby enhancing prediction performance and reducing individual model biases.In the context of multi-sequence brain age prediction, ensemble methods offer a unique advantage.They allow predictions from models trained on diverse MRI sequences to be combined, effectively harnessing complementary information from different sequences that may capture different aspects of brain ageing.In our study, we implemented two ensemble aggregation strategies: mean and weighted.Both strategies outperformed all individual baseline models, clearly demonstrating the value of information integration.Notably, our weighted aggregation strategy indicated that different MRI sequences contribute to prediction accuracy to varying degrees.This insight, which may reflect the differential sensitivity of various specific pre-training, which likely provides an advantage in model adaptation with knowledge directly relevant to the current task.This approach allows our models to leverage domain-specific features and patterns, improving their performance on brain age prediction.Transfer learning dataset size control analysis.Shown is the testing set MAE as a function of out-of-sample fine-tuning training dataset size for axial T 2 -weighted (top left), axial DWI (top right), volumetric T 1 -weighted (bottom left), and axial SWI images from the IXI and OASIS-3 datasets, using the axial T 2 -weighted, axial DWI, sagittal T 1 -weighted, and axial T 2 -weighted baseline models, respectively.In all cases, rapid improvement (i.e.decreased MAE) was observed using as few as 25-100 scans, with improvements plateauing using fine-tuning training datasets larger than 200 scans.In our study, transfer learning played two key roles.It improved the generalisability of our models to out-of-sample scans that closely matched the training data, and importantly, it enabled model adaptation for use with scans underrepresented in the training data.Furthermore, our dataset size control analysis indicated that effective finetuning could be achieved with a very small number of scans, in some cases as low as N = 25, thus supporting the potential utility of transfer learning for brain age prediction, even in settings with very limited data.The ability to fine-tune our models to fit various clinical scenarios and MRI sequences suggests a potential for wider application of brain age estimation.Such fine-tuned models could be used across a broad range of neurological and psychiatric disorders, as well as in healthcare settings with diverse MRI technologies and practices.Furthermore, the fact that this fine-tuning can be accomplished with limited datasets indicates the potential for extending brain age estimation to a variety of patient groups, potentially with varying demographics and varying MRI sequences.Our study has some limitations.While our models exhibited strong generalisability across different MRI sequences, we did not evaluate their performance on more specialised sequences (e.g.perfusion imaging) or in specific clinical scenarios (e.g. for a given diagnosis).Future investigations should focus on evaluating the performance of the models in these specialised scenarios to validate their applicability.Additionally, the level of improvement gained through transfer learning may vary depending on the degree of similarity between the original and new tasks.Further investigation is needed to understand the extent to which transfer learning can enhance performance in different scenarios.Furthermore, the varying sizes of our baseline model training datasets might have influenced the derived weights in the ensemble model, potentially biasing the contribution of each sequence.Therefore, caution should be exercised when interpreting the weights obtained from the ensemble models, and future studies should explore methods to mitigate this potential bias if it All adult ( 18 years) head MRI examinations performed atGSTT and KCH between 2008 and 2019Exclue examinations which do not contain any of i) axial T 2 -weighted; ii) axial DWI; iii) coronal T 2 -FLAIR; iv) sagittal T 1 -weighted; or v) axial GRE T 2 *-weighted images (Tustison et al., 2021)ted Ax.GRE T 2 *-weighted Internal clinical training sets Internal clinical validation and testing sets 2294 F I G U R E 1 Flow chart depicting the process of creating the internal clinical datasets used in this study.To capture the diversity of examinations seen in clinical practice, we did not exclude any reported examinations on the basis of image quality.T A B L E 1 Radiologically normal for age 'internal clinical datasets' used for training and testing our baseline brain age models.istered,aseparateprocesseddatasetwas created.We removed nonbrain tissue from all 560 axial T 2 -weighted scans in the IXI dataset using HD-BET(Isensee et al., 2019), a publicly available deep learningbased skull-stripping tool accessible at https://github.com/MIC-DKFZ/HD-BET.The images were then resampled to a uniform voxel size (1 mm 3 ) and aligned, via non-linear registration, to the MNI152 template using ANTsPy(Tustison et al., 2021).The final images measured 182 mm Â 218 mm Â 182 mm; the corresponding tensors had dimensions 182 Â 218 Â 182.
T A B L E 2 External, publicly accessible 'out-of-sample testing datasets' utilised for transfer learning experiments.These scans were also used to generate a test set of skull-stripped, spatially normalised images (Section 2.2).